<<<<<<< HEAD ======= >>>>>>> c4feee9365e0938fe9b5d3e49b5edf4bb3156a3b

Section 1. Problem Domain Description

COVID-19, which is short for coronavirus disease of 2019, is the illness caused by the SARS-CoV-2 virus first identified in Wuhan, China in December of 2019. Since then, the virus has rapidly spread across the world, leading the World Health Organization to declare a global pandemic. Millions of Americans have been infected by the virus, and hundreds of thousands have died due to the disease with those numbers only continuing to grow each day. A global race to develop a vaccine in record-breaking time ensued, with over 100 different candidates being tested across the globe. Despite multiple vaccines receiving emergency authorizations from multiple different nations, the situation is worsening daily as new mutant strains are being identified such as those identified in the United Kingdom. In the United States, public health officials are struggling to convince the populous that the vaccines are safe and effective, leading to widespread anti-vaccine protests seeking to slow the vaccination efforts, which only lends itself to give the virus more time to develop a mutation to defeat the current vaccine formulations.

Thus, analyzing data related to COVID-19 is worthwhile since it will help people understand the overall situation and severity of the pandemic and arouse their interest in adopting protective measures like mask-wearing, social-distancing, and vaccination. In addition, analyzing this data may expose differences in the ability of different regulations between states to contain the virus, which may prove beneficial in helping state governments are only utilizing restrictions that truly work to contain this pathogen.

Section 2. Data Description

JHU CSSE COVID-19 Data

The COVID-19 Data Repository by the Center for System Science and Engineering (CSSE) at Johns Hopkins University is compiled from sources such as, but not limited to, the World Health Organization and the United States Centers for Disease Control and Prevention (a list of all data sources is provided in the README.md file of the repository) provides case and deaths counts for each state/U.S. territory for each day since the SARS-CoV-2 virus was first detected in Washington state in January of 2020. This data set has been known to provide some of the most up-to-date information possible, which has resulted in many different organizations citing this data as trustworthy and reliable.

UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ Combined_Key X1.22.20
84001001 US USA 840 1001 Autauga Alabama US 32.53953 -86.64408 Autauga, Alabama, US 0
84001003 US USA 840 1003 Baldwin Alabama US 30.72775 -87.72207 Baldwin, Alabama, US 0
84001005 US USA 840 1005 Barbour Alabama US 31.86826 -85.38713 Barbour, Alabama, US 0
84001007 US USA 840 1007 Bibb Alabama US 32.99642 -87.12511 Bibb, Alabama, US 0
84001009 US USA 840 1009 Blount Alabama US 33.98211 -86.56791 Blount, Alabama, US 0
84001011 US USA 840 1011 Bullock Alabama US 32.10031 -85.71266 Bullock, Alabama, US 0
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ Combined_Key Population
84001001 US USA 840 1001 Autauga Alabama US 32.53953 -86.64408 Autauga, Alabama, US 55869
84001003 US USA 840 1003 Baldwin Alabama US 30.72775 -87.72207 Baldwin, Alabama, US 223234
84001005 US USA 840 1005 Barbour Alabama US 31.86826 -85.38713 Barbour, Alabama, US 24686
84001007 US USA 840 1007 Bibb Alabama US 32.99642 -87.12511 Bibb, Alabama, US 22394
84001009 US USA 840 1009 Blount Alabama US 33.98211 -86.56791 Blount, Alabama, US 57826
84001011 US USA 840 1011 Bullock Alabama US 32.10031 -85.71266 Bullock, Alabama, US 10101
Data Set Features of Note
  • Admin2: name of county/political subdivision of U.S. state/territory
  • Province_State: name of U.S. state/territory
  • Xmm.dd.yy: one feature per day since the SARS_CoV_2 virus was first detected in the United States representing the case/death count of the county/political subdivision definied by the Admin2 feature; takes the format of Xmm.dd.yy where mm is the one- or two-digit month as a decimal, dd is the one- or two-digit day of the month as a decimal, and yy is the two-digit year without century as a decimal

HIFLD Hospitals

The Homeland Infrastructure Foundation-Level Data Hospitals (HIFLD Hospitals) data set published by the United States Department of Homeland Security and compiled from sources from the United States Department of Health & Human Services and Centers for Disease Control and Prevention provides a list of all hospitals in the United States and their associated trauma level. It identifies how many hospitals and of what type exist in each state.

NAME STATE TYPE BEDS TRAUMA
CENTRAL VALLEY GENERAL HOSPITAL CA GENERAL ACUTE CARE 49 NA
LOS ROBLES HOSPITAL & MEDICAL CENTER - EAST CAMPUS CA GENERAL ACUTE CARE 62 NA
EAST LOS ANGELES DOCTORS HOSPITAL CA GENERAL ACUTE CARE 127 NA
SOUTHERN CALIFORNIA HOSPITAL AT HOLLYWOOD CA GENERAL ACUTE CARE 100 NA
KINDRED HOSPITAL BALDWIN PARK CA GENERAL ACUTE CARE 95 NA
LAKEWOOD REGIONAL MEDICAL CENTER CA GENERAL ACUTE CARE 172 NA
Data Set Features of Note
  • STATE: two-letter U.S.P.S. abbreviation of state name
  • TYPE: type of hospital; value can be "GENERAL ACUTE CARE", "CRITICAL ACCESS", "PSYCHIATRIC", "LONG TERM CARE", "REHABILITATION", "MILITARY", "SPECIAL", "CHILDREN", "WOMEN", or "CHRONIC DISEASE"
  • STATUS: current status of hospital; value either "OPEN" or "CLOSED"
  • LATITUDE: latitude of hospital
  • LONGITUDE: longitude of hospital
  • BEDS: number of beds available at hospital; value of -999 represents an unknown count of beds
  • TRAUMA: non-standard trauma center level identifier (definitions can be found in the HIFLD Trauma Levels Data Set); value of "NOT AVAILABLE" indicates the hospital is not classified as a trauma center

NYT Mask-Wearing Survey

The NYT Mask-Wearing Survey data set contains estimates of mask-usage from 250,000 survey responses for each county in the US. Each participant was asked “How often do you wear a mask in public when you expect to be within six feet of another person?” and given the choices of never, rarely, sometimes, frequently, or always. The survey was done in 2020 from July 2 to July 14, and was assembled by The New York Times and Dynata.

COUNTYFP NEVER RARELY SOMETIMES FREQUENTLY ALWAYS
1001 0.053 0.074 0.134 0.295 0.444
1003 0.083 0.059 0.098 0.323 0.436
1005 0.067 0.121 0.120 0.201 0.491
1007 0.020 0.034 0.096 0.278 0.572
1009 0.053 0.114 0.180 0.194 0.459
1011 0.031 0.040 0.144 0.286 0.500
  • The COUNTYFP column is the FIPS code for the county.
  • The rest of the columns are estimates for the percent of people in that county who responded with that option. Those values always add up to about one.

CDC COVID-19 Vaccinations in the United States

The COVID-19 Vaccinations in the United States data set contains number of vaccine doses administered by state. Data on COVID-19 vaccine doses administered in the United States are collected by vaccination providers and reported to CDC through multiple sources, including jurisdictions, pharmacies, and federal entities, which use various reporting methods, including Immunization Information Systems, Vaccine Administration Management System, and direct data submission.

State Total_Doses_Administered Doses_Administered_per_100k X18._Doses_Administered X18._Doses_Administered_per_100K
Alaska 239927 32797 238872 43308
Alabama 815108 16624 814893 21361
Arkansas 540192 17900 540003 23300
American Samoa 18816 33788 18600 42821
Arizona 1525794 20962 1524293 27034
Bureau of Prisons 52743 NA 52740 NA
  • Total doses administered column is the total number of vaccine doses that have been given to people.

  • Doses administered per 100k column is the total number of vaccine doses given for every 100,000 people.

  • 18+ Doses Administered column is the total number of vaccine doses that have been given to people for the overall population

  • 18+ Doses administered per 100k column is the total number of vaccine doses given for every 100,000 people aged 18 years and older.


Stay At Home Order Effectiveness For Each State

The Infection rates before and after stay at home orders went into effect set contains a list of each state and the date on which the first stay at home order was put into effect. It also has infection rates for days before and after the enstatement of these orders. Infection rates were calculated using daily COVID-19 daily cases collected by Johns Hopkins Center for Health Security.

State Order.date Infection.rate.and.confidence.interval..before.order. Infection.rate.and.confidence.interval..after.order.
Alabama 4/4/20 0.099 (0.088, 0.109) 0.042 (0.039, 0.045)
Alaska 3/28/20 0.11 (0.095, 0.126) 0.03 (0.027, 0.032)
Arizona 3/31/20 0.134 (0.124, 0.143) 0.03 (0.025, 0.036)
California 3/19/20 0.084 (0.077, 0.091) 0.055 (0.05, 0.06)
Colorado 3/26/20 0.11 (0.1, 0.121) 0.04 (0.035, 0.044)
Connecticut 3/23/20 0.154 (0.136, 0.172) 0.065 (0.059, 0.07)
  • State column is the state abbreviation for each state where data was available in the U.S.

  • Order.date column is the date on which the first stay at home order was put into effect.

  • Infection.rate.and.confidence.interval.before.order column is the infection rate and confidence interval for this rate for the day before the order went into effect

  • Infection.rate.and.confidence.interval.after.order column is the infection rate and confidence interval of this rate for the day after the order went into effect.


Subsection 2.2 Summary Analysis

JHU CSSE COVID-19 Data

Between 2020-01-22 to 2021-02-24, 2.833609710^{7} total cases of COVID-19 have been detected in the United States and 5.058910^{5} total deaths have been ruled as being caused by COVID-19.

date total_cases total_deaths
Min. :2020-01-22 Min. : 1 Min. : 0
1st Qu.:2020-04-30 1st Qu.: 1107214 1st Qu.: 67774
Median :2020-08-08 Median : 5022981 Median :163216
Mean :2020-08-08 Mean : 7786083 Mean :177519
3rd Qu.:2020-11-16 3rd Qu.:11337674 3rd Qu.:249572
Max. :2021-02-24 Max. :28336097 Max. :505890

As seen in the distributions of cases and deaths by state, California and Texas both appear as outliers with higher numbers of both cases and deaths. However, when the population of these states is taken into account, it begins to provide a possible explanation of the higher numbers found in these states. Additionally, the epidemiologic data suggests that mutated variants of the SARS-CoV-2 that are more infectious and transmissible may be to blame for the high number of cases in these states.


HIFLD Hospitals

As seen in the above visualizations of the geographic distributions of hospitals and trauma centers in the United States, health care institutions tend to be located around population centers. The distributions also show that larger states with larger populations have more hospitals and trauma centers, and are more likely to have lower level trauma centers. Additionally, lower level trauma centers, on average, have more beds for patients that facilities with a higher trauma level.

BEDS
Min. : 2.0
1st Qu.: 29.0
Median : 89.0
Mean : 158.8
3rd Qu.: 221.8
Max. :1592.0
NA’s :180
<<<<<<< HEAD

=======

>>>>>>> c4feee9365e0938fe9b5d3e49b5edf4bb3156a3b

As seen in the box plot, there are quite a few outliers when it comes to the distribution of beds among trauma center levels. This is likely due to the different populations of different regions, as facilities in more highly-populated areas will need more beds for patients than those in rural areas. It is likely that trauma centers are created based not on population, but rather, geographic distance to another facility able to provide the same level of care.


NYT Mask-Wearing Survey

Grouped by counties, an average of 51% of the responses are “Always,” and an average of 8% of the responses are “Never.” For a single county, the values for each response are supposed to sum to one. In reality, the values are rounded to three decimal places, so the sum for each county ranges from 0.998 to 1.002.

NEVER RARELY SOMETIMES FREQUENTLY ALWAYS sum
Min. :0.00000 Min. :0.00000 Min. :0.0010 Min. :0.0290 Min. :0.1150 Min. :0.998
1st Qu.:0.03400 1st Qu.:0.04000 1st Qu.:0.0790 1st Qu.:0.1640 1st Qu.:0.3932 1st Qu.:1.000
Median :0.06800 Median :0.07300 Median :0.1150 Median :0.2040 Median :0.4970 Median :1.000
Mean :0.07994 Mean :0.08292 Mean :0.1213 Mean :0.2077 Mean :0.5081 Mean :1.000
3rd Qu.:0.11300 3rd Qu.:0.11500 3rd Qu.:0.1560 3rd Qu.:0.2470 3rd Qu.:0.6138 3rd Qu.:1.000
Max. :0.43200 Max. :0.38400 Max. :0.4220 Max. :0.5490 Max. :0.8890 Max. :1.002

There doesn’t seem to be any significant outliers. This is probably because there were 250,000 survey responses for a survey with only 5 options. Any individual county would have to have a lot of different responses in order to be able to become an outlier. Also, there is less chance for outliers because this data set was grouped into counties, forcing all of the columns for each row to sum to one. There are no NA values, and it seems to have data for almost every county.


CDC COVID-19 Vaccinations in the United States

By Feb 22th, there are 68150728 people in the US got vaccination. Grouped by states, there are an average of 21242 per 100,000 (21.2415%) of population in the US given doses. The number of doses administered per 100,000 ranges from 11767 to 39499.

Total_Doses_Administered Doses_Administered_per_100k X18._Doses_Administered X18._Doses_Administered_per_100K
Min. : 7073 Min. :11767 Min. : 7073 Min. :15081
1st Qu.: 241471 1st Qu.:18891 1st Qu.: 240832 1st Qu.:24127
Median : 614928 Median :19881 Median : 614420 Median :25428
Mean :1097822 Mean :21231 Mean :1096961 Mean :27224
3rd Qu.:1396224 3rd Qu.:22824 3rd Qu.:1395704 3rd Qu.:28548
Max. :7728120 Max. :39499 Max. :7724412 Max. :50641

The most significant outlier in the data set is the total vaccination population in California. The possible reason might be overall education level in that states is high and also the population base in CA is large so that there are a great number of people taking the vaccine.


Stay At Home Order Effectiveness In Different U.S. States

Between {r date_range[1]} and {r date_range[2]} there were {r num_orders)} different states which instituted stay at home orders. The average decrease in COVID-19 infection rates due to stay at home orders was {r average.infection.rate.change} with the maximum decrease being {r max.infection.rate.change} and the minimum being {r min.infection.rate.change}.

Order.date infection.rate.before.order infection.rate.after.order infection.rate.change
Min. :0020-03-19 Min. :0.0720 Min. :0.01900 Min. :0.01800
1st Qu.:0020-03-24 1st Qu.:0.0985 1st Qu.:0.03400 1st Qu.:0.05050
Median :0020-03-27 Median :0.1100 Median :0.04200 Median :0.07000
Mean :0020-03-27 Mean :0.1147 Mean :0.04426 Mean :0.07049
3rd Qu.:0020-04-01 3rd Qu.:0.1240 3rd Qu.:0.05600 3rd Qu.:0.08650
Max. :0020-04-07 Max. :0.1970 Max. :0.06600 Max. :0.14300

When it came to the values of infection rates before and after stay at home orders were imposed in various states, there were a few outliers in the values for infection rates before states imposed stay at home orders as well as the difference between infection rates before and after orders were imposed. This is probably because of the inconsistency between the amount of time between the two values were taken from as well as population density in those states. The outliers for infection rates before orders were imposed occurred in Alabama, Alaska, Arizona and the values were 0.072, 0.079, 0.081. The outlier for infection rate changes occurred in West Virginia and the value was 0.018.


Section 3: Specific Question Analysis

Does the number and types of hospitals affect the death rate of COVID-19, and if so, how?

In every state, there are many different hospitals of many different sizes with many different capabilities. With this knowledge, the question can be asked, does the number of types of hospitals affect the death rate of COVID-19, and if it does, how? In this question, when referring to the “type” of hospital, it references the trauma center level (potentially) assigned to a hospital based on its capabilities to handle trauma patients as defined by the American College of Surgeons. Trauma centers are assigned a level from I to V, with level I trauma centers having the most advanced capabilities and surgeons and specialists available at any time, whereas level V trauma centers are capable of diagnosing and stabilizing trauma patients long enough for them to survive to a lower level trauma center. Additionally, the death rate of COVID-19 is a cause-specific death rate, meaning it measures the frequency of death in a defined population over a specified interval. In this instance, it is measured in deaths per 100,000 members of the population.

To determine this, the number of different types of hospitals in the 50 states and other U.S. territories was compared to the death rate measured in each location since the SARS-CoV-2 virus was first detected in the United States. To do this, the HIFLD Hospitals data was used along with the JHU CSSE COVID-19 Data. The hospital data was filtered to only include relevant hospitals (i.e. general acute care hospitals rather than psychiatric or rehabilitation hospitals), and a standardized ACS trauma level was applied to relevant observations as different states used different methods of denoting trauma levels. This was then aggregated by state to produce a count of the different types of hospitals in each individual state and territory which can be seen in the following table. Data about the death counts over time from the JHU CSSE COVID-19 Data data set was then imported, and the resulting data frame was transformed into transformed into long form to provide a total death count for each state and territory for each day since the virus first appeared in the U.S. The most recent day’s worth of death totals was filtered, which was then combined with the region’s population data so that a death rate could be calculated. This was then joined with the existing hospital counts for each state, which was then plotted on a scatter plot using the type of hospital as a facet. Each plot was then had a linear regression trend line applied to it as shown in the below plots.

State Non-trauma Hospitals Level I Level II Level III Level IV Level V
AK 8 0 3 0 18 0
AL 51 4 2 52 0 0
AR 34 1 3 17 35 0
AS 1 0 0 0 0 0
AZ 57 11 0 8 27 0
CA 392 15 35 13 5 0
CO 22 3 11 26 32 0
CT 21 3 7 1 0 0
DC 6 3 0 0 0 0
DE 3 1 0 6 0 0
FL 230 10 23 0 0 0
GA 137 5 9 8 8 0
GU 3 0 0 0 0 0
HI 19 1 1 2 0 0
IA 15 2 4 19 90 0
ID 42 0 3 1 0 0
IL 143 17 40 0 0 0
IN 116 3 6 13 0 0
KS 106 2 2 5 35 0
KY 87 2 1 4 12 0
LA 172 2 4 3 0 0
MA 71 11 1 7 0 0
MD 49 1 4 3 0 0
ME 31 1 2 0 0 0
MI 132 8 23 7 0 0
MN 30 4 5 19 77 0
MO 70 15 20 27 3 0
MP 1 0 0 0 0 0
MS 27 1 3 15 61 0
MT 25 0 4 4 8 24
NC 119 6 3 8 0 0
ND 46 1 5 0 0 0
NE 49 1 3 5 37 0
NH 20 1 2 6 1 0
NJ 74 4 6 0 0 0
NM 36 1 0 6 6 0
NV 50 1 2 2 0 0
NY 171 20 13 10 0 0
OH 168 12 10 20 0 0
OK 30 2 2 26 73 0
OR 19 2 6 10 27 0
PA 176 18 12 2 1 0
PR 61 0 0 0 0 0
PW 1 0 0 0 0 0
RI 13 1 0 0 0 0
SC 79 5 1 1 0 0
SD 11 0 3 1 7 38
TN 121 7 2 7 0 0
TX 268 16 21 54 195 0
UT 32 2 3 5 11 3
VA 87 5 7 5 0 0
VI 2 0 0 0 0 0
VT 14 1 0 0 0 0
WA 27 1 8 22 35 14
WI 50 3 9 33 50 0
WV 28 2 3 3 24 0
WY 5 0 2 4 11 9

Based on the results of the analysis, it does not appear as though there is a significant relationship between the number or types of hospitals in a state or territory and the death rate from COVID-19. This is evidenced largely through the six faceted scatter plots, as in each plot, the trend line depicted clearly does not depict a significant correlation between the the facility count and death rate. Interestingly enough, contrary to my belief, three of the six trends actually depicted slight increases in the death rate with increases in the number of hospitals, specifically those that are level I, level II, or non-trauma hospitals. Level V trauma centers depicted a slight downwards trend, whereas the level III and level IV centers appearred to have no correlation at all with death rate.

One of the most interesting things that this might suggest is the important of wearing a mask and practicing proper social distancing. As seen in the geographic distribution of trauma centers, lower level (levels I and II) trauma centers and non-trauma hospitals tend to be grouped near population hotspots in urban cities such as Los Angeles, Houston, Chicago, and New York, to name a few. This suggests that the death rate is more concerned with the ability of the virus to spread among individuals, which is the case in these large urban areas. This would support what public health officials have been saying the vast majority of the time, which is that it is so very critical for every, but especially those coming into contact with those outside of their household often, to wash their hands, wear a mask, and keep your distance.

Does the percent of vaccination population in each state affect the rate of cases? If so, how?

Rate of cases here is the ratio of new cases and total population for each state on the most recent date. Percent of vaccination population is the percentage of population that has given a vaccine.

To do the data analysis, I first calculate the new cases of each state on most recent date. Then I calculate the rate of cases by taking the ratio of new cases and population of each state. Lastly I compare the rate of cases with percent of vaccination to find potential relationship.

State Most Recent Ratio of Vaccination Population Case Rate at Most Recent Date
New Mexico 0.29211 0.0002119
South Dakota 0.26816 0.0002922
West Virginia 0.26261 0.0001375
North Dakota 0.25795 0.0001493
Connecticut 0.25044 0.0004202
Wyoming 0.23527 0.0000757
Vermont 0.23091 0.0001235
Oklahoma 0.22911 0.0002005
Montana 0.22738 0.0001862
Maine 0.22275 0.0001211
Massachusetts 0.21764 0.0003041
Wisconsin 0.21634 0.0001437
Colorado 0.21419 0.0001982
Arizona 0.20962 0.0001742
Minnesota 0.20751 0.0001321
Virginia 0.20689 0.0002216
Florida 0.20514 0.0003248
Oregon 0.20508 0.0000972
Nebraska 0.20304 0.0001947
North Carolina 0.20196 0.0003127
New Hampshire 0.20171 0.0002441
Rhode Island 0.19980 0.0004296
Michigan 0.19881 0.0001559
New Jersey 0.19804 0.0003515
Washington 0.19614 0.0001118
California 0.19559 0.0001452
Louisiana 0.19518 0.0001895
Iowa 0.19490 0.0002443
Nevada 0.19428 0.0001620
New York 0.19393 0.0003181
Indiana 0.19293 0.0001478
Delaware 0.19165 0.0002807
Ohio 0.19067 0.0001572
Illinois 0.19066 0.0001598
Utah 0.19060 0.0002453
Kentucky 0.19058 0.0002888
Idaho 0.19050 0.0002274
Maryland 0.18732 0.0001421
Pennsylvania 0.18599 0.0002178
South Carolina 0.18279 0.0004040
Missouri 0.18101 0.0001002
Arkansas 0.17900 0.0002647
Kansas 0.17759 0.0003764
Georgia 0.17711 0.0002992
Mississippi 0.16990 0.0002255
Tennessee 0.16691 0.0002349
Alabama 0.16624 0.0002527
Texas 0.16555 0.0002609

According to both quantitative and graphical result, it shows that there is a slight correlation between Ratio of Vaccination Population and Case Rate in each state at most recent date.

This result is unexpected to me. I was originally sure of this correlation to be strong. My assumption is as the percentage of vaccination population goes up, the cases rate goes down. The possible reasons for this unexpectency are:

  • Lack of data due to the fact that CDC vaccination overtime data By State is not available to public. Only vaccination data for each state at most recent data and overtime data for the entire US is available
  • Other than vaccination, there are many other factors influencing the case rate, such as different quarantine policy across different states, different attitude towards mask, and different population density.
  • The result of vaccination may take a longer time to reflect on the case rate since there are still relatively few people taken the vaccine and the actual implement of vaccine is less than three month.